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Motivation 

From  Image  Processing 


2-D  Grating  Patterns 


•  Certain  2-D  geometric  patterns 
transform  to  dots  in  a  2-D 
spatial  frequency  plane* 


^  ■ 


■  ■ 


Time-frequency  distributions 
contain  “geometric  patterns” 
due  to  harmonic  content 


Possible  use 

Pitch  estimation 
Noise  reduction 
Multi-speaker  separation 


From  R.L.  DeValois  and  K.K.  DeValois,  Spatial 
Vision,  Oxford  University  Press,  1988. 


Narrowband  Spectrogram 
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2-D  Spectrogram  Model 

Short-Space  2-D  Sine 


•  Harmonic  line  structure  of  the  narrowband  spectrogram  is  modeled  over  a 
small  region  by  a  2-D  sine  function  sitting  on  a  flat  pedestal  of  unity 


•  2-D  window  is  applied  to  extract  a  short-time  segment  and  2-D  Fourier 
transform  is  then  computed 


x[n,  m]  =  w[n,  m](  1  +  cos (cogm)) 


X(cox,co2)  =  2W(cqx,cq2)  +  W(cq1,cq2  -*»  )  +  W(cox,co2  +a>g) 


2-D  Impulses  are 
smeared  by  2-D 
windowing 

Distance  from 
origin  of  smeared 
impulses  varies 
inversely  with  pitch 

Angle  of  impulses 
from  vertical  axis 
varies  with  pitch 
dynamics 
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2-D  Processing 

Example 

2-D  analysis  of  narrowband  spectrogram  of  all-voiced  female  speech 


Speech 


Spectrogram 


2-D  Fourier 
Transforms 


2-D  (Hamming) 
window  dimension: 

« lOOmsxlOOOZ/z 


DC  region 
removed 


Henceforth,  refer  to  2-D  mapping  as  the  “Grating  Compresssion  Transform”  (GCT)  to 
highlight  mapping  “gratings”  to  concentrated  “dots” 
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2-D  Processing 

Example  with  Noise 


*  2-D  analysis  of  all-voiced  female  speech  in  noise 

-  GCT  without  and  with  additive  white  Gaussian  noise  at  average  SNR  of  ~3  dB 


Speech 


Clean 


2-D  Fourier 
Transforms 


Noisy^ 


•  Energy  concentration  of  GCT  is  typically  preserved  at  roughly  the  same 
location  as  for  the  clean  case 

However,  when  noise  dominates  so  that  little  harmonic  structure  remains  within  the  2-D 
window,  energy  concentration  deteriorates,  as  in  the  vicinity  of  0.95  s  and  2000  Hz 
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Pitch  Estimation 

GCT-Based  Approach 


•  GCT  of  speech  examples  motivate  a 
simple  pitch  estimator 


-  Pitch  estimate  is  reciprocal  to  the 
distance  from  the  origin  to  the 
maximum  value  in  the  GCT 


frequency 


*  Pitch  rate  of  change  is  proportional  to 
angle  of  GCT  peak  from  vertical  axis 
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GCT-Based  Pitch  Estimation 

Example 


GCT-based  estimator  over  time 
-  2-D  analysis  window  slid  along 
the  speech  spectrogram  at  a  10 
ms  frame  interval  at  low- 
frequency  location 


1.2  1.4  1.6  1.8 

Time  (s) 


-  Relatively  robust  in  noise,  out¬ 
performing  a  sinewave-based 
pitch  estimator 


GCT-Based  Estimator 


Sinewave-Based  Estimator 


MIT  Lincoln  Laboratory  ““ 


5 


GCT-Based  Pitch  Estimation 

Performance 


GCT-based  estimator 


-  2-D  analysis  window  slid  along 
the  speech  spectrogram  at  a  10- 
ms  frame  interval  at  a  low- 
frequency  location  given  by  the 
2-D  window  in  previous  slide 

-  Average  magnitude  difference 
measured  between  pitch- 
contour  estimates  with  and 
without  white  Gaussian  noise 
for  both  the  GCT-  and  sinewave- 
based  estimators 


Performance  Measurements 


FEMALES 

MALES 

9dB 

3dB 

9dB 

3dB 

GCT 

0.5 

6.7 

0.9 

6.7 

SINE 

5.8 

40.5 

2.6 

12.8 

Average  magnitude  error  (in  dB)  in  GCT- 
and  sine-wave-based  pitch  contour  estimates 
for  clean  and  noisy  all-voiced  passages.  The 
two  passages  “Why  were  you  away  a  year 
Roy?”  and  “Nanny  may  know  my  meaning.” 
from  two  male  and  two  female  speakers  were 
used  under  noise  conditions  9  dB  and  3  dB 
average  signal-to-noise  ratio. 
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Pitch  Estimation 

Multi-resolution  properties 


•  Pitch  in  the  2-D  plane 


-  Pitch  can  be  obtained  anywhere 
in  the  2-D  plane 

-  “Wavelet-like  tiling”  of  2-D 
window  found  to  give  the  most 
consistent  estimate 

Reflects  increase  pitch  FM  with 
increasing  frequency 


Three  pitch  contours 


Window  A:  Dashed  Green 
Window  B:  Dashed-Dot  Red 
Window  C:  Solid  Blue 
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Two-Speaker  Pitch  Estimation 


•  Sum  of  two  speakers  has 

spectrogram  with  two  harmonic 
sets 


Blind  use  of  one-speaker  pitch 
estimator  on  two-speaker  signal 


2-D  Analysis  Window  Speaker  A 


•  GCT  gives  two  pairs  of  dots,  one 
pair  for  each  speaker 


GCT-Based  Estimator  i - 

Maximun 
of  GCT 
latches  on 
to  speaker 
with 
(locally) 
largest 
energy 


-  All-voiced  example  (male  +  female) 


Sine-Wave-Based  Estimator 


0  ia  a  *  *5  £«  «  » 
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Formant  Estimation 

The  High-Pitched  Problem 


Synthesized  vowel  /ah/  with  330-Hz 
pitch.  Speech  spectrum  generated 
from  short-time  Fourier  analysis  with  a 
20-ms  Hamming  window. 


Speech  Spectrum.  fO  =  330,  /ah/ 


Collection  of  harmonic  samples 
from  pitch  sweep  ranging  from 
305-355  Hz.  Contrast  to  fO  = 
330  Hz  shown  in  Figure  1. 


Spectral  Sampling  with  Changing  fO,  305-355  Hz,  /ah/ 


Frequency  (Hz) 


MIT  Lincoln  Laboratory  ““ 


8 


Spectral  Estimation:  Results 

Average  Percent  Formant  Error 


Females 


12  3  4 

Method  Number 


Males 
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Methods 

1.  Single  STFT  slice 

2.  Cepstral  littering 

3.  Proposed  method 

4.  Spectral  slice  averaging 

Relative  gains  (method  1)  for 
[FI,  F2,  F3]  via  proposed: 

-  Females:  [61%,  61%,  73%] 

-  Males:  [62%,  82%,  87%] 

Gains  for  F3  greatest  (wider 
harmonic  sampling) 

Individual  formant  scores 
across  vowels  also 
consistent 
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Spectral  Estimation:  Results 

Average  MSE 


Average  MSE  (Females) 


0  5  10  15 

Average  MSE 


Average  MSE  (Males) 
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Methods 

1.  Single  STFT  slice 

2.  Cepstral  littering 

3.  Proposed  method 

4.  Spectral  slice  averaging 

Results  are  consistent  with 
formant  frequency 
estimation 

-  Method  3  outperforms 
others  for  all  vowels 

Data  not  shown: 

-  Consistent  results  with 
children’s  formants 
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Speaker  Recognition:  Methods 


9  Data  set  -  male  and  female  subsets  of  TIMIT  corpus 
9  Baseline  system 

-  Mel-cepstrum  feature  extraction  with  20  ms.  window  and  10 
ms  frame  interval  +  delta  features 

-  Adaptive  Gaussian  Mixture  Modeling 

128  mixture  components 
Universal  background  model 

9  System  modifications  in  feature  extraction 

-  Short-time  analysis  using  10  ms  frame  window  and  2  ms 
frame  interval 

-  Compute  average  of  spectral  slices  spanning  ~30  ms 

-  Derived  spectra  are  used  for  computing  standard  mel- 
cepstrum  +  deltas 
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II  Speaker  Recognition:  Results 


*  Equal  error  rate  (EER) 


*  Absolute  EER  reduction  in 
females  2.26%  (not  yet 
significant) 


2  5 

False  Alarm  probability  (in  %) 
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Baseline  (EER) 

Proposed  (EER) 

Males 

1.86%<  2.45%  <3.39% 

1.53%  <2.15%  <2.80% 

Females 

3.12%<  4.41%  <5.64% 

1.55%  <2.15%  <3.30% 

TIMIT  126-mix  GMM 


- Females  (20  ms.  win,  10  ms.  frame) 

- - -Females  ( 10  ms.  win,  2  ms.  frame,  10  sice  avg 

Males  (20  ms.  win,  10  ms.  frame) 

— -  Males  (10  ms.  win.  2  ms.  frame.  10  slice  avg 
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Limitation  of  the  Spectrogram 

Observations 


Two  curious  effects  are  seen: 

-  Frequency  tracks  moving  in  the 
wrong  direct,  e.g.,  up  rather 
than  down  and 

-  Crossing  tracks,  i.e.,  tracks 
moving  up  and  down 
simultaneously. 

The  problem  is  that  the  basis 
functions  of  the  Fourier 
transform,  stationary  sinusoids, 
cannot  resolve  the  speech 
harmonics  which  have  rapid 
frequency  modulation  and  are 
closely  spaced  in  frequency. 

-  This  lack  of  resolution  leads  to 
the  complex  line  phenomena 
seen  in  Figure  2. 


Speech 


Narrowband  Spectrogram 


:  in  D  i4  t.l  fl  v  i 


Time  (sec) 
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Spectral  Sensitivity* 

Example 

Example:  One-sample  shift  (0.1  ms) 


•  Harmonic  speech  spectra  can 
be  quite  sensitive  to 
aberrations  in  periodicity  of 
the  glottal  source 

•  Even  small  perturbations  can 
lead  to  short-time  spectral 
changes  that  mislead  the 
viewer  in  terms  of  signal 
composition 


Input  Signal:  Original  (blue).  Modified  (red) 


1.5 

1 

0.5 

1  |  |  : 
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Time  (s) 

Spectrwn  Original  (blue),  Modified  (red) 

3| 

2 

1 

°o 
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Frequency  (Hz)  \ 
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3 

2 
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Frequency  (Hz) 


*We  have  developed  formal  spectral  for  these  sorts  of  effects:  To  be 
published  in  January  2008  IEEE  TSLP,  “Spectral  representations  of 
nonmodal  phonation,”  Malyska  and  Quatieri. 
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We  see  that  the  shift  in  time  domain  seems 
to  move  the  harmonics  in  the  higher 
frequencies 
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An  Alternate  Transform 

The  Fan-Chirp  Transform 


•  The  Fan-Chirp  Transform  (FChT) 

-  “Adaptive  Chirp-Based  Time-Frequency  Analysis  of  Speech 
Signals” 

Marian  Kepesia  and  Luis  Weruaga,  Speech  Communication,  vol. 
48,  no.  5,  pp.  474-492,  May  2006. 

-  “The  Fan-Chirp  Transform  for  Non-Stationary  Harmonic 
Signals” 

Luis  Weruaga  and  Marian  Kepesia,  (submitted  to  Elsevier) 

•  FChT  is  a  generalization  of  the  Fourier  transform 

-  Fourier  transform  basis  functions  are  stationary  sine  waves 

-  FChT  basis  functions  are  sine  waves  with  linear  frequency 
modulation 

-  the  set  of  basis  functions  has  a  fan  geometry 

-  1st  order  match  to  harmonic  frequency  modulation  in  speech 
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Spectrogram  Comparison 

STFT  versus  Short-Time  Fan-Chirp 


Observations 

-  FChT  resolves  high  frequency  harmonics  even  when 
frequency  modulation  is  large 

-  Frequency  tracks  appear  as  predicted  for  FChT 


Short-Time  Fourier  Transform 


:  u  i  D  i4  i.J  d  0  i 


Time  (sec) 
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Short-Time  Fan  Chirp  Transform 


>i  a  is  oi  o  .* 


Time  (sec) 
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Fan-Chirp  for  Grating  Compression  Transform 

Pitch  Estimation 

*  Based  on  preliminary  results,  for  pitch  estimation  in  noise,  the 
short-time  Fan-chirp  tranform  appears  to  outperform  the  STFT 

STFChT 

STFT 
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Conclusions 
and  Directions 


•  The  grating  compression  transform  (GCT)  maps  harmonically-related  signal 
components  to  a  concentrated  entity  in  a  spatial  2-D  frequency  plane 

•  The  GCT  forms  the  basis  of  a  pitch  estimator  that  uses  the  radial  distance  to  the 
largest  peak  of  the  GCT 

-  The  resulting  pitch  estimator  appears  robust  under  noise  conditions  and  amenable  to 
extension  to  two-speaker  pitch  estimation 

•  The  GCT  forms  the  basis  of  a  formant  estimator  that  exploits  separability  of  speech 
source  and  vocal  tract  information  via  changing  pitch 

•  Although  the  spectrogram  provides  a  useful  starting  point  for  the  GCT,  alternate 
transforms  can  provide  improved  performance 

Fan-chirp  transform  is  one  possibility 

•  Possible  GCT  directions 

-  Alternate  time-frequency  distributions 
Pitch  estimation 

Extended  evaluation  to  a  larger  corpus  and  use  of  voiced/unvoiced  speech 
Two-speaker  pitch  estimation 
Formant  estimaiton  in  noise 

GCT  as  model  of  auditory  cortical  processing  (Sthamma,  Ezzat,  and  Poggio) 
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